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(54) Method and system for speech recognition using continuous density hidden Markov models 



(57) A method and system tor achieving an 
improved recognition accuracy in speech recognition 
systems which utilize continuous density hidden Markov 
models to represent phonetic units of speech present in 
spoken speech utterances is provided. An acoustic 
score which reflects the likelihood that a speech utter- 
ance matches a modeled linguistic expression is 
dependent on the output probability associated with the 
states of the hidden Markov model. Context-independ- 
ent and context-dependent continuous density hidden 



Markov models are generated for each phonetic unit. 
The output probability associated with a state is deter- 
mined by weighing the output probabilities of the con- 
text-dependent and context-independent states in 
accordance with a weighting factor. The weighting factor 
indicates the robustness of the output probability asso- 
ciated with each state of each model, especially in pre- 
dicting unseen speech utterances. 
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Description 

Technical Fiqld 

The present invention relates to computer speech 
recognition, and more particularly, to a computer 
speech recognition system that utilizes continuous hid- 
den Markov models. 

Background of the Invention 

The area of speech recognition is challenged by the 
need to produce a speaker-independent continuous 
speech recognition system which has a minimal recog- 
nition error rate. The focus in realizing this goal is on the 
recognition algorithm that is utilized by the speech rec- 
ognition system. The recognition algorithm is essen- 
tially a mapping of the speech signal, a continuous-time 
signal, to a set of reference patterns representing the 
phonetic and phonological descriptions of speech previ- 
ously obtained from training data. In order to perform 
this mapping, signal processing techniques such as fast 
fourier transforms (FFT), linear predictive coding (LPC), 
or filter banks are applied to a digital form of the speech 
signal to extract an appropriate parametric representa- 
tion of the speech signal. A commonly-used representa- 
tion is a feature vector containing for each time interval, 
the FFT or LPC coefficients that represent the fre- 
quency and/or energy bands contained in the speech 
signal. A sequence of these feature vectors is mapped 
to the set of reference patterns which identify linguistic 
units, words and/or sentences contained in the speech 
signal. 

Often, the speech signal does not exactly match the 
stored reference patterns. The difficulty in finding an 
exact match is due to the great degree of variability in 
speech signal characteristics which are not completely 
and accurately captured by the stored reference pat- 
terns. Probabilistic models and statistical techniques 
have been used with more success in predicting the 
intended message than techniques that seek an exact 
match. One such technique is Hidden Markov Models 
(HMMs). These techniques are more adept for speech 
recognition since they determine the reference pattern 
that will more likely match the speech signal rather than 
finding an exact match. 

A HMM consists of a sequence of states connected 
by transitions. A HMM can represent a particular pho- 
netic unit of speech, such as a phoneme or word. Asso- 
ciated with each state is an output probability indicating 
the likelihood that the state matches a feature vector. 
For each transition, there is an associated transition 
probability indicating the likelihood of following the tran- 
sition. The transition and output probabilities are esti- 
mated statistically from previously spoken speech 
patterns, referred to as "training data." The recognition 
problem is one of finding the state sequence having the 
highest probability of matching the feature vectors rep- 
resenting the input speech signal. Primarily, this search 



process involves enumerating every possible state 
sequence that has been modeled and determining the 
probability that the state sequence matches the input 
speech signal. The utterance corresponding to the state 
sequence with the highest probability is selected as the 
recognized speech utterance. 

Most HMM-based speech recognition systems are 
based on discrete HMMs utilizing vector quantization. A 
discrete HMM has a finite set of output symbols and the 
transition and output probabilities are based on discrete 
probability distribution functions (pdfs). Vector quantiza- 
tion is used to characterize the continuous speech sig- 
nal by a discrete representation referred to as a 
codeword. A feature vector is matched to a codeword 
using a distortion measure. The feature vector is 
replaced by the index of the codeword having the small- 
est distortion measure. The recognition problem is 
reduced to computing the discrete output probability of 
an observed speech signal as a table look-up operation 
which requires minimal computation. 

However, speech signals are continuous signals. 
Although it is possible to quantitize continuous signals 
through codewords, there may be serious degradation 
associated with such quantization resulting in poor rec- 
ognition accuracy. Recognition systems utilizing contin- 
uous density HMMs do not suffer from the inaccuracy 
associated with quantization distortion. Continuous 
density HMMs are able to directly model the continuous 
speech signal using estimated continuous density prob- 
ability distribution functions, thereby achieving a higher 
recognition accuracy. However, continuous density 
HMMs require a considerable amount of training data 
and require a longer recognition computation which has 
deterred their use in most commercial speech recogni- 
tion systems. Accordingly, a significant problem in con- 
tinuous speech recognition systems has been the use 
of continuous density HMMs for achieving high recogni- 
tion accuracy. 



The present invention pertains to a speech recogni- 
tion system which improves the modeling of the speech 
signal to continuous density HMM corresponding to a 

45 linguistic expression. In the preferred embodiment, the 
recognition system utilizes a context-independent and 
several context-dependent HMMs to represent the 
speech unit of a phoneme in different contextual pat- 
terns. The output and transition probabilities for each of 

so these HMMs are estimated from the training data. The 
output probabilities associated with like states corre- 
sponding to the same modeled phoneme are clustered 
forming senones. A weighting factor for each context- 
dependent senone which indicates the robustness of 

55 the output probability in predicting unseen data is also 
generated. In the preferred embodiment, the weighting 
factor is estimated through deleted interpolation of all 
the data points in the training data. Alternatively, the 
weighting factor can be estimated from a parametric 
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representation of the data points or from randomly-gen- 
erated data points generated from a parametric repre- 
sentation of the data points. 

The recognition engine receives an input speech 
utterance and generates candidate word sequences 
which will most likely match the feature vectors of the 
input speech utterance. The word sequences can be 
composed of various senone alignments corresponding 
to state sequences of HMMs. The recognition engine 
determines which senone/state alignment best matches 
the feature vectors by utilizing an acoustic and language 
probability score. The acoustic probability score repre- 
sents the likelihood that the senone alignment corre- 
sponds to the feature vectors and the language 
probability score indicates the likelihood of the utter- 
ance corresponding to the senone alignment occurring 
in the language. The acoustic probability score is based 
on an output and transition probability analysis. The out- 
put probability analysis utilizes the output probabilities 
of both the context-dependent and context-independent 
senones by weighing each output probability as a func- 
tion of the weighting factor. The output probability hav- 
ing the more robust estimate will dominate the analysis 
thereby improving the output probability analysis. An 
improvement in the output probability analysis improves 
the acoustic score and, in turn, the overall recognition 
accuracy. 

Brief Description of the Drawings 

The foregoing and other features and advantages 
of the invention will be apparent from the following more 
particular description of the preferred embodiment of 
the invention, as illustrated in the accompanying draw- 
ings in which like reference characters refer to the same 
elements throughout the different views. The drawings 
are not necessarily to scale, emphasis instead being 
placed upon illustrating the principles of the invention. 

Figure 1 is a block diagram of a speech recognition 
system employed in the preferred embodiment. 

Figure 2 is a flow diagram of a training method used 
in the system of Figure 1 . 

Figure 3 is flow diagram of the method for calculat- 
ing weighing factors used in the system of Figure 1. 

Figure 4 is a flow diagram of the preferred embodi- 
ment for calculating a new value for lambda as used in 
the system of Figure 3. 

Figure 5 is a flow diagram of a first alternate 
embodiment for calculating a new value for lambda as 
used in the system of Figure 3. 

Figure 6 is a flow diagram of a second alternate 
embodiment for calculating a new value for lambda as 
used in the system of Figure 3. 

Figures 7A and 7B depict an example of the hidden 
Markov models and senone structures associated with 
a phoneme. 

Figure 8 is a flow diagram of the speech recognition 
method used in the system of Figure 1 . 



Detailed Description of the Invention 

The preferred embodiment of the present invention 
recognizes that an improved recognition accuracy can 

5 be obtained in speech recognition systems that employ 
continuous density hidden Markov models by weighing 
different output probabilities representing the same 
phonetic unit relative to the degree to which each output 
probability can predict unseen data. The speech recog- 

10 nition system of the claimed invention receives an input 
speech utterance, in the form of a continuous signal, 
and generates the most likely linguistic expression that 
corresponds to the utterance. The preferred embodi- 
ment recognizes a linguistic expression by matching a 

75 set of feature vectors that form a parametric representa- 
tion of the speech signal to a sequence of hidden 
Markov models (HMMs) which identify possible linguis- 
tic expressions. A HMM may represent a phoneme, and 
a sequence of HMMs may represent words or sen- 

20 fences composed of phonemes. 

Continuous density probability distribution func- 
tions, such as a mixture of Gaussian probability distribu- 
tion functions, can be utilized to represent the output 
probability of a state, since they are more accurate at 

25 modeling a speech signal. The output probability func- 
tion is statistically estimated from training data. Often 
there is an insufficient amount of training data to accu- 
rately estimate the output probability function. To 
account for this problem, context-independent and con- 

30 text-dependent models are constructed for a predeter- 
mined set of phonemes. The output probabilities of the 
context-independent model are then interpolated with 
the output probabilities of the context-dependent model. 
This is done through a weighting or interpolation factor 

35 which estimates the degree to which the output proba- 
bility function of the context-dependent HMM can pre- 
dict data not previously encountered in the training data. 
Thus, the new modified output probability function of a 
context-dependent state is a combination of the output 

40 probability functions of both models weighed in accord- 
ance with the robustness of the estimates. Accordingly, 
in the preferred embodiment, deleted interpolation is 
used to smooth the probability space rather than the 
parameter space. 

45 Figure 1 illustrates a speech recognition system 10 
which can be used to implement the recognition and 
training processes in accordance with the preferred 
embodiment of the invention. The speech recognition 
system 10 contains an input device 12, such as but not 

so limited to a microphone, which receives an input speech 
utterance and generates a corresponding analog elec- 
tric signal. Alternatively, a stored speech utterance that 
is stored on a storage device can be used as the input 
speech utterance. The analog electric signal corre- 

55 sponding to the speech utterance is transmitted to ana- 
log-to-digrtal (A/D) converter 14, which converts the 
analog signal to a sequence of digital samples. The dig- 
ital samples are then transmitted to a feature extractor 
16 which extracts a parametric representation of the 
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digitized input speech signal. This parametric represen- 
tation captures the acoustic properties of the input 
speech utterance. Preferably, the feature extractor 16 
performs spectral analysis to generate a sequence of 
feature vectors, each of which contains coefficients rep- 5 
resenting a spectra of the input speech signal. Methods 
for performing the spectral analysis are well-known in 
the art of signal processing and can include fast Fourier 
transforms (FFT), linear predictive coding (LPC), and 
cepstral coefficients, all of which can be utilized by fea- w 
ture extractor 16. Feature extractor 16 can be any con- 
ventional processor that performs spectral analysis. 
Spectral analysis may be performed every ten millisec- 
onds to divide the input speech signal into a feature vec- 
tor which represents twenty-five milliseconds of the 75 
utterance. However, this invention is not limited to using 
feature vectors that represent twenty-five milliseconds 
of the utterance. Feature vectors representing different 
lengths of time of a speech utterance can also be used. 
This process is repeated for the entire input speech sig- 20 
nal and results in a sequence of feature vectors which is 
transmitted to a data processor 38. Data processor 38 
can be any conventional computer, such as a desktop 
personal computer. The data processor contains a 
switching block 1 8 which routes the sequence of feature 25 
vectors. Switching block 18 can be implemented in 
hardware or software. However, the speech recognition 
system is not limited to executing on a data processor. 
Other types of executable mediums can be used, such 
as but not limited to, a computer readable storage 30 
medium which can be a memory device, compact disc, 
or floppy disk. 

During the initial training phase, switching block 18 
is switched to direct the feature vectors to the training 
engine 20. Training engine 20 utilizes the feature vec- 35 
tors to estimate the parameters of the HMMs which will 
represent the phonemes present in the training data 
and to compute a set of weighting factors for use by the 
recognition engine 34. A more detailed description of 
the method employed by training engine 20 is presented 40 
below with reference to Figures 2-6. Briefly, the training 
engine 20 generates context-independent and context- 
dependent phoneme-based hidden Markov models by 
estimating the parameters for these models from the 
training data. The output distributions for each context- 45 
dependent state are clustered forming senones which 
are stored in senone table storage 30. The senone table 
storage 30, in general, holds senones for both context- 
dependent and context-independent HMMs. Senones 
identifers for each HMM are stored in the HMM storage so 
28. The In addition, a weighting factor for each context- 
dependent senone is calculated and stored in lambda 
table storage 26 for use by the recognition engine 34. 
The lambda table storage 26 holds lambda values 
indexed by context-dependent HMMs. The training 55 
engine 20 also utilizes a text transcript that holds a 
translation 22 of the training data and a dictionary 24 
that contains a phonemic description of each word in 
order to assure that each word is correctly modeled. A 



more detailed description of the operation of the training 
engine 20 will be discussed below. The dictionary 24 
contains a pronunciation of each word in terms of pho- 
nemes. For example, a dictionary entry for "add" might 
be 7AE DDA* 

After the initial training phase, switching block 18 is 
switched to transmit the feature vectors to recognition 
engine 34. Recognition engine 34 recognizes the 
sequence of feature vectors as a linguistic expression 
composed of phonemes that form words, which, in turn, 
form sentences. A detailed description of the method 
employed by the recognition engine 34 is presented 
below with reference to Figure 8. Recognition engine 34 
utilizes the context-independent and context-dependent 
hidden Markov models stored in HMM storage 28, the 
context-dependent and context-independent senones 
stored in senone table storage 30, the weighting factors 
stored in lambda table storage 26, and a language 
model stored in language model storage 32 and diction- 
ary 24. The language model storage 22 may specify a 
grammar. In the preferred embodiment, the linguistic 
expression which is generated from recognition engine 
34 is displayed to an output device 36, such as a con- 
ventional printer, computer monitor, or the like. How- 
ever, this invention is not limited to displaying the 
linguistic expression to an output device. For example, 
the linguistic expression can be used as input into 
another program or processor for further processing or 
may be stored. 

Figures 2-6 are flow charts that illustrate the steps 
performed in the training phase of the system where the 
parameters of the HMMs and senones are estimated 
and the weighting factors are calculated. In short, the 
training method starts off by receiving input speech 
utterances, in the form of words, sentences, para- 
graphs, or the like, and converts them to parametric rep- 
resentations, known as feature vectors The structure of 
the hidden Markov models and senones is formed and 
estimates of the parameters for these data structures 
are calculated from the training data. The weighting fac- 
tors are then determined through the technique of 
deleted interpolation. 

Referring to Figure 2, the training method com- 
mences by receiving a sequence of speech utterances 
(step 42) which is converted into a sequence of feature 
vectors (step 44) as previously described above with 
reference to Figure 1 . The complete set of feature vec- 
tors is referred to as the "training data." In the preferred 
embodiment, LPC cepstral analysis is employed to 
model the speech signal and results in a feature vector 
that contains the following 39 cepstral and energy coef- 
ficients representing the frequency and energy spectra 
contained in the signal: (1)12 LPC mel-frequency cep- 
stral coeffcients, x k {t), for l<= K <= 12; (2) 12 LPC delta 
mel-frequency cepstral coefficients Ax k (t), for l<= K <= 
12; (3) 12 LPC delta-delta mel-frequency cepstral coef- 
ficients AAx^O, for l<= K <= 12; and (4) energy, delta 
energy, and delta-delta energy coeffcients. The use of 
LCP cepstral analysis to model speed signals is well 
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known in the art of speech recognition systems. 

In step 46, senone and HMM data structures are 
generated. Senones are well-known data structure in 
speech recognition systems and a detailed description 
of senones and the method used to construct them can 
be found in M. Huang et al., "Predicting Unseen Tri- 
phones with Senones," Proc. ICASSP '93 Vol. II, pp 
311-314, 1993. In the preferred embodiment, a HMM 
can be used to model the speech unit of a phoneme. 
The HMM can also be referred to as an acoustic model. 
This speech unit is selected in order to accommodate 
large-vocabulary recognition. Modeling individual words 
requires a longer training period and additional storage 
to store the associated parameters. This is feasible for 
small vocabulary systems but impractical for those 
which utilize large vocabularies. However, this invention 
is not limited to phoneme-based HMMs. Other speech 
units, such as words, diphones, and syllables can be 
used as the basis of the HMMs. 

Two types of HMMs can be utilized. A context- 
dependent HMM can be used to model a phoneme with 
its left and right phonemic contexts. This type of model 
captures the contextual dependencies which are usu- 
ally present in word modeling. A context-independent 
HMM can be used to model the phoneme in any context 
that it appears in the training data, therefore making it 
independent of any particular context. Predetermined 
patterns consisting of a set of phonemes and their asso- 
ciated left and right phonemic contexts are selected to 
be modeled by the context-dependent HMM. These 
selected patterns represent the most frequently occur- 
ring phonemes and the most frequently occurring con- 
texts of these phonemes. The training data will provide 
the estimates for the parameters of these models. The 
context-independent models will be based on the 
selected phonemes and modeled within whatever pho- 
nemic context appears in the training data. Similarly, the 
training data will provide the estimates for the parame- 
ters of the context-independent models. 

The use of both a context-independent and context- 
dependent model is beneficial in achieving an improved 
recognition accuracy. Each model's robustness is 
related to the amount of training data used to estimate 
its parameters which also enables it to predict data not 
present in the training data. The combination of the two 
models provides a more robust estimate benefiting from 
the training of both models. For example, the context- 
dependent model is beneficial at modeling co-articula- 
tory effects but may be poorly trained due to limited 
training data. (Although a speaker may try to pronounce 
words as concatenated sequences of phones, the 
speaker's articulator cannot move instantaneously to 
produce unaffected phones. As a result, a phone is 
strongly inverted by the phone that precedes it and the 
phone that follows it in a word. The effects are "coartic- 
ulatory effects"). By contrast, the context-independent 
model is highly trainable thereby producing more robust 
estimates which are less detailed. The combination of 
these two models, weighed in the appropriate manner, 



can be used by the recognition engine to produce a 
more accurate acoustic probability score. 

Further to account for the between-speaker differ- 
ences, such as the formant frequencies (i.e., vocal tract 

s resonance frequencies), present in the male and female 
vocal tracts, the HMM can utilize a mixture of unimodal 
distributions for the output probability distributions func- 
tions (referred to throughout this application as "output 
pdf"). Preferably, a mixture of Gaussian probability den- 

10 sity functions can be used. However, this invention is not 
constrained to this particular limitation. Mixtures of other 
well-known continuous density functions can be used, 
such as the Laplacian and Kq- type density functions. 
Further, to capture similarity between states of drf- 

15 ferent context-dependent phonemes and to increase 
the amount of training data available for each senone, 
the output distributions of like states of different context- 
dependent phonetic HMM models for the same context- 
independent phone are clustered together forming 

20 senones. 

Figure 7A illustrates an example of a context-inde- 
pendent HMM structure for the phoneme /aa/ 114. The 
context-independent HMM includes three states, 
denoted as state 1 (111), state 2 (112), and state 3 

25 (1 13). The HMM depicted in Figure 7A models the pho- 
neme /aa/ with any left and right phonemes that appear 
in the training data as designated by the notation (*,*) in 
Figure 7A. The first position within the parentheses des- 
ignates the phoneme that precedes the given phoneme 

30 and the second position designates the phoneme that 
follows the given phoneme. Senones are classified 
within like states (e.g., state 1) for each type of model 
{e.g., context-dependent vs. context-independent) cor- 
responding to the same phoneme. In this example, the 

35 context-independent HMM has senones 10, 55, and 
125 corresponding to states 1, 2 and 3 respectively. 

Figure 7B shows an example of a corresponding 
context-dependent HMMs for the phoneme /aa/. In Fig- 
ure 7B there are five content-dependent models which 

40 model the phoneme /aa/ in five distinct phonemic con- 
texts (115-119). For example, the context-dependent 
model /aa/ (/dh/, Vb/) 115, models the phoneme /aa/ in a 
context where the left or preceding phoneme is /dh/ and 
where the phoneme /b/ succeeds or is to the right of it. 

45 The senones are classified within like states in the dif- 
ferent HMMs. In state 1 , there are two context-depend- 
ent senones, denoted as senones 14 and 25. Overall, 
for the phoneme /aa/, there are 2 context-dependent 
senones 14 and 35 and 1 context-independent senone 

so 10 for state 1 ; 2 context-dependent senones 25 and 85 
and 1 context-independent senone 55 for state 2; and 1 
context-dependent senone 99 and context- independent 
senone 125 for state 3. 

Accordingly the phoneme-based continuous den- 

55 sity HMM used in the preferred embodiment can be 
characterized by the following mathematical definition: 

(1) A/, the number of states in the model: preferably, 
three states are employed. However, the invention 
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is not restricted to three, as many as five may alter- 
natively be used. 

(2) M, the number of mixtures in the output pdf. 

(3) A = {a,y} , the state transition probability distri- 
bution, from state / to state /. 

(4) B = {bj(x) }, output probability distribution, the 
probability of emitting feature vector x when in state 
/, where 

M 

6/(x)= £cfrN(x,^,Wc) ( (1) 



where N (x, \i kt V k ) denotes the multi- 
dimensional Gaussian density function defined by 
mean vector \i k and covariance matrix V k ; 

The number M of mixture-components is typi- 
cally, anywhere from 1 to 50; and 

c k is the weight for the Mh mixture component 
in state /. 

The output probability distribution associated 
with each state /, is represented by senone, sd, and 
can be denoted as p(x\ sdj). 
(5) 7i = {ti ,} , the initial state distribution. 

For convenience, the compact notation 
a = [A, B, n) is used to denote the complete parame- 
ter set of the model which is otherwise known as the 
parameter space of a HMM. 

In step 48 of Figure 2, the parameters for the 
senone, context-dependent HMM, and the context-inde- 
pendent HMM are estimated. The training phase of a 
HMM consists of estimating these parameters using the 
training data, a text of the speech 22, and a dictionary of 
phonemic spellings of words 24. The output and transi- 
tion probabilities can be estimated by the well-known 
Baum-Welch or forward-backward algorithm. The 
Baum-Welch algorithm is preferred since it makes bet- 
ter use of training data. It is described in Huang et al., 
Hidden Markov Models For Speech Recognition, Edin- 
burgh University Press, 1990. However, this invention is 
not limited to this particular training algorithm, others 
may be utilized. Normally, about five iterations through 
the training data can be made to obtain good estimates 
of the parameters. 

In step 50 of Figure 2, weighting or interpolation 
factors for each content-dependent senone are gener- 
ated and are denoted by the mathematical symbol, X. 
The weighting factors will be used to interpolate the out- 
put probabilities of the context-independent HMM with 
the output probabilities of the context-dependent HMM. 
The weighting factors indicate the adeptness of the con- 
text-dependent output pdf at predicting unseen data. 
The output pdf is estimated with training data and will 
closely predict data which resembles the training data. 
However, it is impossible to estimate the output pdf with 
training data that represents every possible input 
speech utterance, or with sufficient training data for it to 



predict all unseen data correctly. The role of the weight- 
ing factor is to indicate the adeptness of the output pdf 
to predict unseen data which is a function of the training 
data used to estimate the context-dependent and con- 

5 text-independent models. As the amount of training data 
for the context-dependent models gels large, X will 
approach 1 .0 and the output pdf will be heavily weighed. 
With a small amount of training data for the context- 
dependent model, X will approach 0.0 and the output 

10 pdf will be weighed less. The optimal value for X for each 
context-dependent senone is determined by deleted 
interpolation. 

Briefly, the technique of deleted interpolation parti- 
tions the training data into two distinct sets. One set is 

15 used to estimate the parameters of the model and a 
second set is used to determine the weighting factor 
which indicates how well the output pdf can predict 
unseen training data. This process is iterative where at 
each iteration the different sets are rotated and a new 

20 model and weighting factor is produced. At the end of all 
the iterations, the average value of the weighting factor 
is calculated and used in the recognition phase. 

Figures 3-6 illustrate the steps used in computing 
the weighting factors. Referring to Figure 3, the training 

25 data is partitioned into K blocks in step 60. Preferably, 
there are two blocks of data. However, the invention is 
not limited to this number of blocks, others may be used 
dependent on the constraints of training data storage 
and training time. 

30 A weighting factor is calculated for each context- 
dependent senone (step 62) by first finding sen Sh which 
is the context-independent senone that corresponds to 
sen SD {i.e., the context-dependent senone) using the 
senone table (step 63). The calculation is derived 

35 through an iterative process, step 64, which converges 
when the difference between the new value of X, 
denoted as X neWt meets a certain threshold. Preferably, 
the process converges or finishes when | X - X new \ < 
.0001. The process commences by selecting an initial 

40 value for X t step 66. Initially, for the first iteration of a 
senone, an initial value is preselected by the user. Pref- 
erably, the initial value can be an estimated guess as 
0.8. For all other iterations, the initial value will be the 
previously calculated new value, X = X new .In step 68, 

45 the process iterates K times. At each iteration, one 
block of data is selected as the deleted block, and the 
selected deleted block is one that was not chosen previ- 
ously, step 70. 

The process then proceeds to estimate the output 

so probabilities for each context-dependent (denoted as 
£)-,) and context-independent (denoted as b 2 ) senone 
using the training data from all the blocks except the 
deleted block (step 72). These parameters are esti- 
mated using the same technique as described above in 

55 reference to the estimation of the parameters of the 
HMMs in the training phase (i.e., Baum-Welch algo- 
rithm). 

Next in step 74, a new value, X new> is computed The 
computation assumes that "forced alignment" is sued. 
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During training, if the Viterbi algorithm is used, each fea- 
ture vector in the training data can be identified with a 
specific senone. This mapping or vectors with senone is 
known as "forced alignment." X nm is calculated in 
accord with the following mathematical definition: 



/-r 



(X*bUxi)) 



fc1(x/)+(1-A>62(x,.)) 



(2) 



where 



N 



f>iW 



= number of data points or feature vectors in 

the deleted block that corresponds to senone 

sen SD using forced alignment 

= feature vector i,\<i<N 

= context-dependent output pdf as defined 

by equation (1) above 

= content-independent output pdf as defined 
by equation (1) above 

X*b^{xD+(\-xyb 2 (xD : referred to as 
the overall probability. 



A value of X new is determined for each of the K iter- 
ations. At the completion of all K iterations, in step 76, 
an average value is computed which can be repre- 
sented by the following mathematical expression: 



A new k 



(3) 



2>, 



where 



J 
K 

Km 



m index of deleted block; 
= number of blocks; 

= estimate of X using deleted block j\ and 
= number of points in deleted block ; that cor- 
respond to sen SD using forced alignment. 
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aligned with sen SD . The contribution of the centext- 
dependent output pdf for data point x, relative to the 
overall probability is determined in step 82 in accord 
with the following mathematical definition: 



(X*b1{xi)) 



(X*bUxi)+{1-X)*b2{xi)) 



(4) 



The sum of the contributions for all data points com- 
puted thus far are totaled in step 84, At the completion 
of the iteration when all the data points in the deleted 
block that are aligned with sen SD have been processed, 
the average of the contributions is computed, X new step 
86 in accord with equation (2) above 

The above computation of the weighting factor uses 
the data points in the deleted block. This produces a 
more accurate computation at the expense of increas- 
ing the training time as well as the amount of storage 
needed by the training engine to perform the computa- 
tions. In some instances, it may be more advantageous 
to generate a parametric representation of the data 
points in the deleted block that correspond and to use 
the appropriate parameters instead. Another alternative 
is to use reconstructed data points from the parametric 
representation of the data points to sen SD These alter- 
natives provide a coarse approximation of the data 
points but has the advantage of computational effi- 
ciency. 

Figures 5 and 6 depict these alternate embodi- 
ments for the calculation of the weighting factors. Figure 
5 depicts the first alternate embodiment. Referring to 
Figure 5, a parameteric representation for the data 
points in the deleted block is generated as shown in 
step 90. In this case, the parametric representation is a 
mixture of Gaussians. This representation can be made 
using the Baum-Welch algorithm as described above. 
The parameters generated include the mean, and 
weight, Cp for each mixture component /. The computa- 
tion of the new value of lambda, x new can be made in 
accord with the following mathematical definition for a 
deleted block nj. 



Steps 66 through 76 are executed again if the value 
of X new does not meet the prescribed threshold. When 45 
the process converges for a particular context-depend- 
ent senone, the current value of X new is stored in 
lambda table 26 for the particular context-dependent 
senone. 

Figure 4 depicts a flowchart of the steps used in so 
computing the new value for the weighting factor, X new 
in accord with equations (2) and (3) above. The new 
value is computed by summing the contribution of the 
context-dependent output pdf relative to the overall 
probability for each data point in the deleted block, ss 
Thus, in step 79, all points in the deleted block that cor- 
respond to sen SD are found using the model generated 
in step 48 and forced alignment. In step 80, the process 
iterates for each data point x, in the deleted block that is 



*>new ^(Vb^p+V-XYb^,)) W 
where 

M = number of normal mixture components; 
Cj = weight of the /th normal mixture component; 
note that 

M 

Ec, = i 
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= mean of the /th normal mixture component; 



Steps 92 - 98 perform this calculation in the follow- 
ing manner. Step 92 iterates for each mixture and deter- 
mines the contribution of the context-dependent output s 
probability over the overall probability for the mixture 
having the corresponding mean and weight parameters. 
For mixture components, this is represented mathemat- 
ically as: 

10 



(X*bUxi)+(^X)*b2(xi)) 



(9) 



cM\*bl(»j) 



l*b1{nj)+(1-X)*b2(M) 



(6) 



In step 96, the sum of these contributions is formed 75 
for all the mixture components. In step 98, the final sum 
resulting from step 96 is stored as the value of X new for 
the current sen SD and the deleted block: 

Referring to Figure 3, at the completion of the K 
iterations, the process proceeds to calculate the aver- 20 
age value for X new in step 76 in accord with equation (3) 
above. The process continues as described above with 
reference to Figure 3 until the process converges and 
the current average value X new stored in the lambda 
table 26 for the particular context-dependent senone. 25 

In the second alternative embodiment of the calcu- 
lation of the weighting factors, a select number of data 
points are used, which are randomly generated from a 
parametric representation for the senone. Figure 6 
depicts this second alternate embodiment which can be 30 
described mathematically for a deleted block according 
to equation (2) set forth above, except that {x} = gener- 
ated data points and N - the number of generated data 
points. 

This alternative embodiment differs from the pre- 35 
ferred embodiment, shown in Figure 3, in the determi- 
nation of the new value of X new , (step 74). The flow 
sequence remains as shown in Figure 3. Referring to 
Figure 6, in step 100, a parametric representation is 
generated for the data points in the deleted block. The 40 
parametric representation can consist of a mixture of 
Gaussians. This parametric representation can be 
derived using the Baum-Weich algorithm on the training 
data in the deleted block. From this parametric repre- 
sentation a prescribed number of data points is recon- 45 
structed using a random number generator with the 
mean and weight parameters, as shown in step 102. 
The number of data points that are reconstructed is a 
trade-off between the desired accuracy of X new and the 
computational requirements. A higher number of data so 
points improves the accuracy of the X new at the cost of 
greater computational requirements. A suitable number 
of reconstructed data points per mixture is 100. 

In step 104, steps 106 and 108 are performed for 
each data point in the set, step 104. In step 106, the 55 
contribution of the context-dependent output probability 
over the overall probability for the data point is deter- 
mined. This can be represented mathematically as: 



In step 1 08, the sum of these contributions is formed for 
all the data points in the set. At the completion of the 
iteration through all the data points in the set, the aver- 
age of all of the contributions is returned as the value of 
X new , (step 110). Referring to Figure 3, at the comple- 
tion of the K iterations, the process proceeds to calcu- 
late the average value for X^ in step 76 in accord with 
equation (3) above. The process continues as 
described above with reference to Figure 3 until the 
process converges and the current average value X new 
stored in the lambda table 26 for the particular context- 
dependent senone. 

Once the training data has been generated and 
stored in the appropriate storage locations, the recogni- 
tion system is ready to execute. The primary task of the 
speech recognition system is to detect the linguistic 
message which is embodied in the input speech signal. 
This task is a multi-level decoding problem since it 
requires matching the sequence of feature vector's to a 
sequence of phonetics, matching the sequence of pho- 
nemes to a sequence of words, and matching the 
sequence of words to a sentence. This is performed by 
forming all possible linguistic expressions that have 
been modeled and calculating the probability that the 
expression matches the sequence of feature vectors. 
Since a linguistic expression is composed of a 
sequence of phonemes, the determination can involve 
calculating the likelihood that the phonomes forming the 
expression match the feature vectors and that the 
expression is likely to occur (i.e., grammatically cor- 
rect). The probability that the phonemes forming an 
expression match the feature vectors can be referred to 
as an acoustic score and the probability that the expres- 
sion occurs can be referred to as the language score. 
The language score takes into consideration the syntax 
and semantics of the language, such as the grammar of 
the language, and indicates whether the sequence of 
words corresponding to the sequence of phonemes 
form a grammatically-correct linguistic expression. 

In the preferred embodiment, phonemes are repre- 
sented by HMM's where the output pdf of like states are 
clustered forming senones. The process of matching a 
feature vector to a phoneme then entails matching a 
feature vector to the senones associated with the states 
of a HMM representing the phoneme. Thus, the linguis- 
tic expression can be composed of senones corre- 
sponding to states of a sequence of HMMs. 

In the preferred embodiment of the invention, the 
task of the recognition engine can be one of finding the 
word sequence W which maximizes the probability 
P(WIX). The probability P(W/X) represents the proba- 
bility of the linguistic expression W occurring given the 
input speech signal X. W can be a word string denoted 
as W = w 1t w 2 , w n , where Wj denotes individual 
words, each word is represented by a sequence of pho- 
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nemes, Wj = p v p 2 , .... p^ and X is the input speech 
signal represented by a sequence of feature vectors, 
denoted as X = x^ , x 2 .... x n . This maximization problem 
can be solved using a modified version of the well- 
known Bayes formula which is described mathemati- s 
cally as: 

P{ WIX) = (P(XI W)P( W))/P(X). (10) 

P(XIW) is the probability that the input speech signal X 10 
matches the word string W, and is referred to as the 
acoustic score. P( W) is the probability that the word 
string W will occur, and is referred to as the language 
score. Since P(X) is independent of W, maximizing 
P{W/X) is equivalent to maximizing the numerator, 15 
namely P(W/X)P(W) over all word sequences W. 

The recognition task considers various word 
sequences in trying to determine the best match. For 
each word sequence that is considered by the recogni- 
tion task, it computes an acoustic score and a language 20 
score, A language score indicates how likely the word 
sequence is, in the language, and is indicated by the 
term P(W) in the above equation (10). An acoustic 
score indicates how well the sequence of acoustic vec- 
tor features matches the acoustic model for the word 25 
sequence W. The acoustic score is indicated by the 
term P(X/W) in the above formula 

In computing the acoustic score for a given word 
sequence, the recognition task considers various 
senone alignments. A senone alignment is a mapping 30 
from the sequence of acoustic feature vectors to 
senones which assigns a unique senone to each acous- 
tic feature vector. Only the senone alignments which 
would result in the word sequence under consideration 
are considered by the recognition task. An acoustic 35 
score for the word sequence under the constraints of 
each senone alignment is computed. The acoustic 
score for the word sequence is the best acoustic score 
over all possible senone alignments. 

Mathematically, this can be expressed as: 40 

P(XIW) = Max [P(X/(VM,))] 
over / = 1 to Q 

where 45 

. . . Aqi all possible senone alignments for word 
sequence W 

The computation of the acoustic score for word so 
sequence W under the constraint of a given senone 
alignment A can be further expressed as: 



P(X f(W,A))=(Yl p (x»Sdi)r p ( A ) 55 

/=1 



where senone alignment A aligns or maps the /th 



acoustic feature vector x, to the context dependent 
senone sd,. P(A) represents the state transition proba- 
bility for the senone sequence sdj . . . 5^. P(x>/sd,) rep- 
resents the probability that feature version x, matches 
context-dependent senone sdj. 

The essence of the acoustic score is the computa- 
tion of the output probability p(x|sd). This represents 
the likelihood that the feature vector, x, matches the 
senone, sd, which corresponds to a context-dependent 
HMM state. However, a poorly estimated output pdf can 
contribute to inaccuracies in the computation of the 
acoustic score. This usually occurs due to insuffcient 
training data. The robustness of the distribution 
increases with the use of more training data to estimate 
output pdf. 

One way to reduce this problem is to utilize several 
HMMs which model the same phenomes at several lev- 
els of detail. The output pdf for a particular state can 
then be constructed by using the output pdfs at various 
levels of detail and combining them. The combination is 
done based on the ability to predict data not seen during 
training. A robust output pdf which is more adept at pre- 
dicting unseen data will receive a higher weight while a 
poorly estimated output pdf will receive a lower weight in 
the combined output pdf. In the preferred embodiment, 
several context-dependent HMMs and a context-inde- 
pendent HMM are utilized to model a phoneme. A 
weighting factor, X, for each senone corresponding to a 
context-dependent state which was computed previ- 
ously in the training phase is used to indicate the weight 
each senone is given. The larger X is (approaches 1 .0) 
the more the context-dependent senone dominates and 
the less the context-independent senone is weighed. 
When X is small (approaches 0.0), the context-inde- 
pendent senone dominates. Thus, the computation of 
the output probability, p{x\sd), can be represented by 
the following mathematical definition: 

p(x\sd) = X * p(x\sd d ) + {\-X)* p(x\sd^. (12) 

where 

X is the weighting factor between 0 and 1 for 

senone sd; 

x is the feature vector; 

sd d is the senone associated with a state of a con- 
text-dependent HMM, 

sdj is the senone associated with the correspond- 
ing state of a context-independent HMM; 
pMsdj) is the probability of the feature vector x 
matching senone sd dt and 
p(x\sdj) is the probability of the feature vector x 
matching senone sd,. 

Thus, the output probability, p{x\sd) t is nearly line- 
arly interpolated as a function of the output probabilities 
of context-dependent and context- independent 
senones. The weighting or interpolation factor X indi- 
cates the degree to which each senone is interpolated. 
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Figure 8 depicts the operation of the speech recog- 
nition method. Referring to Figure 8, the method com- 
mences by receiving an input speech utterance, (step 
122), which is converted to feature vectors (step 124), 
which was previously detailed above with reference to 5 
Figure 1. In step 126, the method performs steps 128- 
136 for each word sequence that can represent the 
input speech utterance. The word sequence can consist 
of a variety of different senone alignments, where each 
senone alignment corresponds to a sequence of HMM 10 
states. In steps 128-134 a combined recognition score 
for each possible senone alignment which can repre- 
sent the word sequence is determined. The combined 
recognition score can be determined in accord with the 
modified Bayes formula as denoted above in equation 15 
(10). The combined recognition score consists of an 
acoustic score and a language score. The acoustic 
score is determined in step 130, the language score is 
determined in step 132, and the combined score is 
computed in step 134. The senone alignment having 2 o 
the highest combined recognition score is then selected 
to represent the word sequence, step 136. In step 138, 
the method recognizes the input speech utterance as 
the word sequence having the highest combined recog- 
nition score. 25 

In step 130, the acoustic score can be determined 
as described above in accord with equation (1 1) where 
the output probability is computed as described above 
in equation (12). 

In step 132, the method computes a language 30 
score based on the language models representing lin- 
guistic expressions stored in language model storage 
32. Language models use knowledge of the structure 
and semantics of a language in predicting the likelihood 
of the occurence of a word considering the words that 35 
have been previously uttered. The language made can 
be a bigram language model where the language score 
is based on the probability of one word being followed 
by a particular second word. Alternatively, the language 
model may be based on N-grams other than bigrams or 40 
each on subword language probabilities. In addition, 
other lexical knowledge such as syntax and grammati- 
cal rules can be employed to create the language 
model. Methods for creating and using language mod- 
els are well-known in the art and are described in more 45 
detail in the Huang et al. book referred to above. 

The above detailed invention improves the recogni- 
tion capability of a speech recognition system by utiliz- 
ing multiple continuous density output probabilities 
corresponding to the same speech event in different so 
contexts. This improves the mapping of the feature vec- 
tors to the hidden Markov models since it improves the 
model's performance in predicting speech events that 
the model was not trained with. An improvement at this 
level is extremely beneficial since the mapping at this 55 
level is the foundation upon which the recognition proc- 
ess further builds on. 

However, it should be noted that this invention is not 
restricted to a speech recognition system. Any applica- 



tion which requires the matching of a speech utterance 
to a linguistic expression can utilize the claimed inven- 
tion. The speech utterance can be any form of acoustic 
data, such as but not limited to, sounds, speech wave- 
forms, and the like. An example of such an application is 
a speech synthesis system which utilizes probabilistic 
models to generate a speech waveform from a text 
string representing a linguistic expression. 

Although the preferred embodiment of the invention 
has been described hereinabove in detail, it is desired 
to emphasize that this is for the purpose of illustrating 
the invention and thereby to enable those skilled in this 
art to adapt the invention to various different applica- 
tions requiring modifications to the apparatus described 
hereinabove; thus, the specific details of the disclosures 
herein are not intended to be necessary limitations on 
the scope of the present invention other than as 
required by the prior art pertinent to this invention. 

Claims 

1 . A method in a computer system for matching an 
input speech utterance to a linguistic expression, 
the method comprising the steps of: 

for each of a plurality of phonetic units of 
speech, providing a plurality of more-detailed 
acoustic models and a less-detailed acoustic 
model to represent the phonetic unit, each 
acoustic model having a plurality of states fol- 
lowed by a plurality of transitions, each state 
representing a portion of a speech utterance 
occuring in the phonetic unit at a certain point 
in time and having an output probability indicat- 
ing a likelihood of a portion of an input speech 
utterance occurring in the phonetic unit at a 
certain point in time; 

for each of select sequences of more-detailed 
acoustic models, determining how close the 
input speech utterance matches the sequence, 
the matching further comprising the step of: 

for each state of the select sequence of 
more<tetailed acoustic models, determin- 
ing an accumulative output probability as a 
combination of the output probability of the 
state and a same state of the less-detailed 
acoustic model representing the same 
phonetic unit; and 

determining the sequence which best matches 
the input speech utterance, the sequence rep- 
resenting the linguistic expression. 

2. A method as in claim 1 where each acoustic model 
is a continuous density hidden Markov model. 

3. A method as in claim 1 wherein the step of deter- 
mining the output probability further comprises the 
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step of weighing the less-detailed model and more- 
detailed model output probabilities with separate 
weighting factors when combined. 

4. A method as in claim 1 wherein the step of provid- 5 
ing a plurality of more-detailed acoustic models fur- 
ther comprises the step of training each acoustic 
model using an amount of training data of speech 
utterances; and 

wherein the step of determining the output w 
probability further comprises the step of weighing 
the less-detailed model and more-detailed model 
output probabilities relative to the amount of train- 
ing data used to train each acoustic model. 

75 

5. A method in a computer system for determining a 
likelihood of an input speech utterance matching a 
linguistic expression, the input speech utterance 
comprising a plurality of feature vectors indicating 
acoustic properties of the utterance during a given 20 
time interval, the linguistic expression comprising a 
plurality of senones indicating the output probability 

of the acoustic properties occurring at a position 
within the linguistic expression, the method com- 
prising the steps of: 25 

providing a plurality of context-dependent 
senones, 

providing a content-independent senone asso- 
ciated with the plurality context-dependent so 
senones representing a same position of the 
linguistic expression; 

providing a linguistic expression likely to match 
the input speech utterance; 
for each feature vector of the input speech 35 
utterance, determining the output probability 
that the feature vector matches the context- 
dependent senone in the linguistic expression 
which occurs at the same time interval as the 
feature vector, the output probability determi- 40 
nation utilizing the context-independent 
senone associated with the context-dependent 
senone; and 

utilizing the output probabilities to determine 
the likelihood that the input speech utterance 45 
matches the linguistic expression. 

6. A method as in claim 5 wherein the output probabil- 
ity comprises a continuous probability density func- 
tion. 50 

7. A method as in claim 5 wherein the step of provid- 
ing a plurality of context-dependent senones further 
comprises the step of training the context-depend- 
ent senones from an amount of training data repre- 55 
senting speech utterances; 

wherein the step of providing a context-inde- 
pendent senone further comprises the step of train- 
ing the context-independent senones from the 



amount of training data; and 

wherein the step of determining the output 
probability further comprises the step of combining 
the context-independent and context-dependent 
senones in accord with the amount of training data 
used to train the senones. 

8. A method as in claim 5 wherein the step of provid- 
ing a plurality of content-dependent senones fur- 
ther comprises the steps of: 

training the context-dependent senones from 
an amount of training data representing speech 
utterances; 

providing a weighting factor for each content- 
dependent senone representing the amount of 
training data used to estimate the senone; and 
wherein the step of determining the out- 
put probability further comprises the step of 
combining the context-dependent senone and 
context-independent senone in accord with the 
weighing factor. 

9. A method as in claim 8 wherein the step of provid- 
ing a weighting factor further comprises the step of 
generating the weighting factor by using a deleted 
interpolation technique on the amount of training 
data. 

10. A method as in claim 8 wherein the step of provid- 
ing a weighting factor further comprises the steps 
of: 

producing a parametric representation of the 
training data; and 

generating the weighing factor by applying a 
deleted interpolation technique to the paramet- 
ric representation of the amount of training 
data. 

11. A method as in claim 8 wherein the step of provid- 
ing a weighting factor further comprises the steps 
of: 

producing a parametric representation of the 
training data; 

providing a set of data points from the paramet- 
ric representation of the training data, the data 
points representing the training data; and 
generating the weighing factor from the appli- 
cation of deleted interpolation to the data 
points. 

12. A method in a computer readable storage medium 
for recognizing an input speech utterance, said 
method comprising the steps of: 

training a plurality of content-dependent contin- 
uous density hidden Markov models to repre- 
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sent a plurality of phonetic units of speech, the 
training utilizing an amount of training data of 
speech utterances representing acoustic prop- 
erties of the utterance during a given time inter- 
val, each model having states connected by 
transitions, each state representing a portion of 
the phonetic unit and having an output proba- 
bility indicating a probability of an acoustic 
property of a speech utterance occurring within 
a portion of the phonetic unit, 
providing a context-independent continuous 
density hidden Markov model for the plurality of 
context-dependent continuous density hidden 
Markov models representing the same pho- 
netic unit of speech; 

providing a plurality of sequences of the con- 
text-dependent models, each sequence repre- 
senting a linguistic expression; 
for each sequence of the context-dependent 
nodes, determining an acoustic probability of 
the acoustic properties of the input speech 
utterance matching the states in the sequence 
of the context-dependent models, the acoustic 
probability comprising the output probability of 
each state of each context-dependent model in 
the sequence and the output probability of the 
context-independent model corresponding to a 
same phonetic unit; and 
utilizing the acoustic probability to recognize 
the linguistic expression which closely matches 
the input speech utterance. 

13. A method as in claim 12, further comprising the 
step of providing a weighting factor for each state of 
the context-dependent models, the weighting factor 
indicating the amount of training data used to train 
the output probability associated with each state; 
and 

wherein the step of determining an acoustic 
probability further comprises the step of weighing 
the output probability of the state of the context- 
dependent model and the state of the context-inde- 
pendent model based on the weighting factor. 

14. A method as in claim 1 3 wherein the step of provid- 
ing a weighting factor further comprises the step of 
deriving the weighting factor from an application of 
deleted interpolation to the amount of training data. 

15. A method as in claim 13 wherein the step of provid- 
ing a weighting factor further comprises the steps 
of: 

producing a parametric representation of the 
training data; and — 
deriving the weighting factor from an applica- 
tion of deleted interpolation to the parametric 
representation of the training data. 



1 6. A method as in claim 1 3 wherein the step of provid- 
ing a weighting factor further comprises the steps 
of: 

s producing a parametric representation of the 

training data; 

generating a set of data points from the para- 
metric representation of the training data; and 
deriving the weighting factor from an applica- 
10 tion of deleted interpolation to the parametric 

representation of the training data. 

17. A computer system for matching an input speech 
utterance to a linguistic expression, comprising: 

15 

a storage device for storing a plurality of con- 
text<fependent and context-independent 
acoustic models representing respective ones 
of phonetic units of speech, the plurality of con- 

20 text-dependent acoustic models which repre- 

sent each phonetic unit having at least one 
associated context-independent acoustic 
model representing the phonetic unit of 
speech, each acoustic model comprising 

25 states having transitions, each state represent- 

ing a portion of the phonetic unit at a certain 
point in time and having an output probability 
indicating a likelihood of a portion of the input 
speech utterance occurring in the phonetic unit 

30 at a certain point in time; 

a model sequence generator which provides 
select sequences of context-dependent acous- 
tic models representing a plurality of linguistic 
expressions likely to match the input speech 

35 utterance; 

a processor for determining how well each of 
the sequence of models matches the input 
speech utterance, the processor matching a 
portion of the input speech utterance to a state 

40 in the sequence by utilizing an accumulative 

output probability for each state of the 
sequence, the accumulative output probability 
including the output probability of each state of 
the context-dependent acoustic model com- 

45 bined with the output probability of a same 

state of the associated context-independent 
acoustic model; and 

a comparator to determine the sequence which 
best matches the input speech utterance, the 
so sequence representing the linguistic expres- 

sion. 

18. A system as in claim 17 wherein each acoustic 
model is a continuous density hidden Markov 

55 model. 

19. A system as in claim 17, further comprising: 

a training device to receive an amount of train- 
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ing data of speech utterances and to estimate 
the output probability for each state of each 
acoustic model with the amount of training 
data; and 

wherein the processor further com- 5 
prises a combining element to determine the 
accumulative output probability of each state, 
the combining element combining the output 
probability of each state of the sequence with 
the output probability of a same state of the w 
associated context-independent acoustic 
model relative to the amount of training data 
used to estimate each output probability. 

20. A system as in claim 1 7, further comprising: 75 

a training device to receive an amount of train- 
ing data of speech utterances used to estimate 
the output probability for each state of each 
acoustic model with the amount of training 20 
data, the training device generating a weighting 
factor for each state of each context-dependent 
acoustic model indicating a degree to which the 
output probability can predict speech utter- 
ances not present in the training data; and 25 

wherein the processor further com- 
prises a combining element to determine the 
accumulative output probability of a state, the 
combining element combining the output prob- 
ability of each state of the sequence with the 30 
output probability of a same state of the associ- 
ated context-independent acoustic model rela- 
tive to the weighting factor for each state. 

21. A system as in claim 20 wherein the weighting fac- 35 
tor is derived by applying a deleted interpolation 
technique to the amount of the training data. 

22. A system as in claim 20 wherein the training device 
further comprises a parametric generator to gener- 40 
ate a parametric representation of the training data; 
and 

wherein the weighting factor is derived by 
applying a deleted interpolation technique to the 
parametric representation of the training data. as 

23. A system as in claim 20 wherein the training device 
further comprises: 

a parametric generator to produce a parametric so 
representation of the training data; 
a data generator to generate a set of data 
points from the parametric representation; and 
wherein the weighting factor is derived 
- by applying a deleted interpolation technique to ss 
the set of data points. 
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